The data presented in this report are part of a study aimed to assess differential gene expression and methylation in vaping versus non-vaping LatinX youths in Pueblo and Denver, CO. Pulmonary function data were also obtained in order to better understand the impacts of vape use on pulmonary function. To assess differential gene expression and methylation, naso-epithelial swabs were obtained from each participating subject. Pulmonary function is assessed using PFTs (Pulmonary Function Tests) and Impulse Oscillometry (IOS).
This data set consists of samples taken from 51 people ages 12-17 yrs from the Pueblo, Denver, and Aurora, CO areas. Vape Status (did you vape in the last 6 months?) and ethnicity are self-reported.
All analyses performed using R version 4.2.1 (2022-06-23)
Subjects are dichotomized to those that used a vaping device in the last 6 months and those who have not based on the variables ever_vape, vape_days, and last_vape. This variable will be referred to as Vape Status throughout this report. One participant (SID = 111) reported that they had used a vaping device 5 out of the last 30 days, but did not respond to last_vape. That participant is recorded as Vaped in this analysis.
Biological sex has been derived from methylation data utilizing the getSex function from the R package minfi 1.43.0.
Subjects’ geographic location, city, was grouped into the new broader variable recruiting_center which encompasses the broader geographic region where they live.
Measures of lung function and IOS were visually inspected for normality using histograms.
Annotation: Ensembl annotation for GrCH 39 ver. 37
Removal of Unwanted Variance: edgeR 3.39.6 and RUVSeq
1.31.0
Differential Expression Analysis: DESeq2 1.37.5
Gene Filtering Parameters
This analysis will conduct a comparison of various gene-filtering
parameters presented in previous analyses and in the current literature
to select parameters best-suited for this study.
Normalization
The following analyses used the function RUVr from the R package RUVSeq.
RUVr uses the deviance residuals from a first pass negative binomial GLM
to perform a factor analysis which corrects for unwanted technical
effects. The first-pass model formula is presented below :
\[raw \ read \ count \sim \beta_0 + \beta_1 * vape \ status \ + \beta_2 *male \ + \beta_3 * age\]
RUVr will be performed with k = 1 through k = 2 factors and the best k for factor analysis will be determined visually using an elbow plot, RLE plots, and dendrograms. Previous analyses used R package DESeq2 to fit the first-pass GLM. This analysis will use edgeR due to its reference in the literature for the RUVr procedure mentioned above.
Transformations
A variance stabilizing transformation (VST) was applied in previous
analyses. This analysis will use the same transformation, but will apply
the transformation only after normalization.
After removing participants with missing values for vape status, we are left with n = 50 subjects. Previous analyses showed n = 12 participants had vaped in the last 6 months. This analysis will use n = 13 participants who had vaped in the last 6 months. The lung function variable fev1 and fev1_fvc reported 22 missing values. IOS measures r5 and x20 reported 1 and 6 missing values, respectively.
#create table 1
vape_6mo_table1 <- tab1_dat %>%
tableby(vape_6mo_lab ~ sex_lab + age + recruitment_center +
latino_lab + fev1 + fev1_fvc +
r5 + x20, data = ., digits = 1, test = FALSE )
#Fix Labels
arsenal::labels(vape_6mo_table1) <- c(vape_6mo_lab = "Vaped in last 6 months",
age = "Age (yrs)", sex_lab = "Sex",
recruitment_center = "Recruitment Center",
latino_lab = "Ethnicity", fev1 = 'FEV1',
fev1_fvc = "FEV1/FVC (%)", r5 = 'R5', x20 = 'X20')
p_male <- (sum(tab1_dat$sex_lab == 'Male')/length(tab1_dat$sex_lab))*100
#Print Tables
summary(vape_6mo_table1, pfootnote = T)| Did Not Vape in Last 6 Months (N=37) | Vaped in Last 6 Months (N=13) | Total (N=50) | |
|---|---|---|---|
| Sex | |||
| Female | 17 (45.9%) | 8 (61.5%) | 25 (50.0%) |
| Male | 20 (54.1%) | 5 (38.5%) | 25 (50.0%) |
| Age (yrs) | |||
| Mean (SD) | 14.6 (1.4) | 14.8 (1.4) | 14.6 (1.4) |
| Range | 12.0 - 17.0 | 13.0 - 17.0 | 12.0 - 17.0 |
| Recruitment Center | |||
| Aurora | 15 (40.5%) | 0 (0.0%) | 15 (30.0%) |
| CommCity/Denver | 13 (35.1%) | 1 (7.7%) | 14 (28.0%) |
| Pueblo | 9 (24.3%) | 12 (92.3%) | 21 (42.0%) |
| Ethnicity | |||
| LatinX | 23 (62.2%) | 11 (84.6%) | 34 (68.0%) |
| Non-LatinX | 14 (37.8%) | 2 (15.4%) | 16 (32.0%) |
| FEV1 | |||
| N-Miss | 10 | 12 | 22 |
| Mean (SD) | 2.6 (0.7) | 3.9 (NA) | 2.6 (0.7) |
| Range | 1.2 - 3.9 | 3.9 - 3.9 | 1.2 - 3.9 |
| FEV1/FVC (%) | |||
| N-Miss | 10 | 12 | 22 |
| Mean (SD) | 0.8 (0.1) | 0.7 (NA) | 0.8 (0.1) |
| Range | 0.5 - 1.0 | 0.7 - 0.7 | 0.5 - 1.0 |
| R5 | |||
| N-Miss | 1 | 0 | 1 |
| Mean (SD) | 4.0 (0.9) | 5.0 (1.3) | 4.3 (1.1) |
| Range | 2.0 - 6.1 | 3.7 - 7.6 | 2.0 - 7.6 |
| X20 | |||
| N-Miss | 4 | 2 | 6 |
| Mean (SD) | 0.1 (0.6) | 0.7 (0.9) | 0.2 (0.7) |
| Range | -1.1 - 2.4 | -1.0 - 2.3 | -1.1 - 2.4 |
#Fix sex Bar Plot
sex_plot_trial <- tab1_dat %>%
group_by(sex_lab, vape_6mo_lab) %>%
summarise(N = n())
#By Recruitment sex (n = 50)
bar_sex <- sex_plot_trial %>%
ggplot(aes(x = sex_lab, y = N, fill = vape_6mo_lab)) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(y = "Count", fill = "Vape Status") +
ggtitle("Sex") +
scale_fill_manual(values = brewer.pal(3, "Set2"))+
theme(axis.title.x=element_blank())
#By Latino(n = 50)
bar_latino <- tab1_dat %>%
ggplot(aes(x = latino_lab, fill = vape_6mo_lab))+
geom_bar(position = "dodge")+
labs( y = "Count", fill = "Vape Status") +
ggtitle("LatinX") +
scale_fill_manual(values = brewer.pal(3, "Set2")) +
theme(axis.title.x=element_blank())
#Fix Recruitment Center Bar Plot
center_plot_trial <- tab1_dat %>%
group_by(recruitment_center, vape_6mo_lab) %>%
summarise(N = n())
center_plot_trial[nrow(center_plot_trial) + 1,] <- NA
center_plot_trial[6,] <- list("Aurora", "Vaped in Last 6 Months", 0)
#By Recruitment Center
recruitment_center_hist <- center_plot_trial %>%
ggplot(aes(x = recruitment_center, y = N, fill = vape_6mo_lab)) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(y = "Count", fill = "Vape Status") +
ggtitle("Center")+
scale_fill_manual(values = brewer.pal(3, "Set2")) +
theme(axis.title.x=element_blank())
#By FEV1/FVC Continuous
hist_fev1_fvc <- tab1_dat %>%
ggplot(aes(x = fev1_fvc, fill = vape_6mo_lab))+
geom_histogram(binwidth = 0.1, breaks = seq(0.5,1,0.1), position = "dodge")+
labs(x = TeX("\\frac{FEV1}{FVC}"), y = "Count", fill = "Vape Status:") +
xlim(0.5,1) +
scale_fill_manual(values = brewer.pal(3, "Set2")) +
ggtitle("FEV1/FVC")
#By FEV1/FVC Box
box_fev1_fvc <- tab1_dat %>%
ggplot(aes(x = vape_6mo_lab, y = fev1_fvc, fill = vape_6mo_lab)) +
geom_boxplot(show.legend = F) +
labs(x = "Vape Status", y = TeX("\\frac{FEV1}{FVC}")) +
scale_fill_manual(values = brewer.pal(3, "Set2")) +
ggtitle(TeX("\\frac{FEV1}{FVC} \ by Vape Status (n = 28)"))
#By R5(hist)
r5_hist <- tab1_dat %>%
ggplot(aes(x = r5, fill = vape_6mo_lab)) +
geom_histogram(binwidth = 1, breaks = seq(1,8,1), position = 'dodge') +
labs(x = "R5", y = "Count", fill = "Vape Status") +
xlim(1,8) +
scale_fill_manual(values = brewer.pal(3, "Set2")) +
ggtitle("R5")
#by R5(box)
r5_box <- tab1_dat %>%
ggplot(aes(x = vape_6mo_lab, y = r5, fill = vape_6mo_lab)) +
geom_boxplot(show.legend = F) +
labs(x = "Vape Status", y = "R5") +
ggtitle("R5 by Vape Status (n = 49)") +
scale_fill_manual(values = brewer.pal(3, "Set2")) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
#by X20(box)
x20_box <- tab1_dat %>%
ggplot(aes(x = vape_6mo_lab, y = x20, fill = vape_6mo_lab)) +
geom_boxplot(show.legend = F) +
labs(x = "Vape Status", y = "X20") +
ggtitle("X20 by Vape Status (n = 44)") +
scale_fill_manual(values = brewer.pal(3, "Set2")) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
#by x20(his)
x20_hist <- tab1_dat %>%
ggplot(aes(x = x20, fill = vape_6mo_lab)) +
geom_histogram(bins = 10, breaks = seq(-2,3,1), position = "dodge") +
labs(x = 'X20', y = "Count", fill = 'Vape Status') +
scale_fill_manual(values = brewer.pal(3, "Set2")) +
ggtitle("X20")Figure 1 visualizes the demographic information presented in Table 1. The majority of vaping subjects were recruited in Pueblo (92%) and identified as LatinX (85%). 50% of subjects were male as confirmed from the available methylation data.
Figure 2 is a visualization of the pulmonary function (\(\frac{FEV1}{FVC}\)) and IOS (R5 and X20) variables. \(\frac{FEV1}{FVC}\) was only completed by n = 22 individuals from the study population. R5 and X20 represent n = 49 and n = 44 individuals, respectively.
#Respiratory Figures
ggarrange(hist_fev1_fvc, ggarrange(r5_box, x20_box, ncol = 2, nrow = 1, legend = "none"), nrow = 2, common.legend = T, legend = "bottom")Figure 3 identifies correlation patterns in the clinical variables of
interest. To test for correlation, a contingency table was made for each
comparison of variables. because some contingency tables contained cells
with <5 subjects, a Fisher’s Exact test was used for all tests.
From the correlation matrix, it appears that there may be significant correlation between vape status and recruitment center. A separate t-test was run to determine independence between vape status and age.
| Test Variable | Group 1 | Group 2 | Mean Group 1 | Mean Group 2 | t | df | p-value |
|---|---|---|---|---|---|---|---|
| Age | Did Not Vape in Last 6 Months | Vaped in last 6 months | 14.6 | 14.9 | -0.7 | 16.6 | 0.5 |
After merging clinical metadata with gene-count data, there were n = 47 samples present. From the clinical metadata, SID = 105, 137, and 103 are missing genetic data. Sample 23 (SID = 144) was also removed due to missing vape status.
The table below compares gene-filtering parameters from previous analyses to a parameter presented by the creators of the RUVseq package.
source(here("Code/02_gene_filter_comparison.R"), local = knitr::knit_global())
filter_compar| Filter | Incluison Criteria | Gene Count Before | Gene Count After | Genes Removed |
|---|---|---|---|---|
| 1 | At least 25% of the samples have > 0 reads | 60,651 | 31,505 | 29,146 |
| 2 | The range of reads across all samples < 100 | 31,505 | 16,860 | 14,645 |
| 3 | >5 reads in at least 2 samples (Bioconductor) | 60,651 | 29,141 | 31,510 |
The figure below is used to visually assess how many genes are removed across cutoff values for the range of reads for each gene across the 47 samples after filtering out genes that have 0 read counts in 75% or more of the samples. The red point represents the number of genes (14,645) removed at the cutoff range of 100 (the value used in previous analyses).
After reviewing the comparison of filters, it appears that filtering parameters presented by the creators of RUVSeq may be overly conservative for this application. Analyses will proceed using the same filters as previous analyses (filters 1 & 2 from table 2). The first filter will remove genes with 0 reads in more than 75% of samples, and the second will remove genes with low variation (range of reads < 100) that may be considered “house-keeping” genes. This analysis will include a total of n = 16,860 genes.
The following figure shows the Relative Log Expression (RLE) and Principal Component Analysis (PCA) of read counts for each sample without any prior transformations.
It appears that Sample 12 (SID = 102) may be an outlier. Below are the RLE and PCA plots with the sample removed.
To further assess the removal of sample 12, the following is an Elbow Plot which displays the reduction in within-cluster sums of squares when increasing the number of clusters (k) both with and without sample 12.
To visually inspect for the best cutoff for factor analysis, RUVr was run for k = 1 and k = 2 factors both with and without Sample 12 included in the dataset. The RLE and PCA plots for each are presented below.
As was discussed previously, it appears that Sample 12 may be an
outlier. Below are the RLE plots after RUVr was conducted with k = 1 and
k = 2 with Sample 12 removed from the dataset.
To help additionally compare, the following are side-by-side PCA plots
comparing samples with and without sample 12 removed before RUVr, after
RUVr with k = 1, and after RUVr with k = 2.
The dendrograms for k = 1 and k = 2 are compared below with and without Sample 12 included.
Using the figures above (with special attention towards Figure 10), It seems appropriate to keep sample 12 and use k = 2 factors for RUVr normalization.